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Abstract 

It is shown that under suitable regularity conditions, differential entropy is 0(yn)-Lipschitz 
as a function of probability distributions on R." with respect to the quadratic Wasserstein dis¬ 
tance. Under similar conditions, (discrete) Shannon entropy is shown to be 0(n)-Lipschitz in 
distributions over the product space with respect to Ornstein’s d-distance (Wasserstein distance 
corresponding to the Hamming distance). These results together with Talagrand’s and Marton’s 
transportation-information inequalities allow one to replace the unknown multi-user interference 
with its i.i.d. approximations. As an application, a new outer bound for the two-user Gaussian 
interference channel is proved, which, in particular, settles the “missing corner point” problem 
of Costa (1985). 


1 Introduction 

Let X and X be random vectors in R”. We ask the following question: If the distributions of X 
and X are close in certain sense, can we guarantee that their differential entropies are close as well? 
For example, one can ask whether 

D{Px\\P^) = o{n) 4 \h{X) - h{X)\ = o{n). (1) 

One motivation comes from multi-user information theory, where frequently one user causes inter¬ 
ference to the other and in proving the converse one wants to replace the complicated non-i.i.d. 
interference by a simpler i.i.d. approximation. As a concrete example, we consider the so-called 
“missing corner point” problem in the capacity region of the two-user Gaussian interference chan¬ 
nels (GIG) [Gos85a]. Perhaps due to the explosion in the number of interfering radio devices, 
this problem has attracted renewed attention recently [Cosll, BPS14, CR15, RC15]. For further 
information on capacity region of GIG and especially the problem of corner points, we refer to a 
comprehensive account just published by Igal Sason [Sasl5]. 

Mathematically, the key question for settling “missing corner point” is the following: Given 
independent n-dimensional random vectors Xi,X 2 ,G 2 ,Z with the latter two being Gaussian, is it 
true that 


D{Px,+z\\Pg 2 +z) = o{n) 4 |/i(Ai +X 2 + Z)- h{Xi + G 2 + Z)| = o(n). (2) 
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To illustrate the nature of the problem, we first note that the answer to (1) is in fact negative 
as the counterexample of X ~ AA(0, 2I„) and X ~ ^AA(0, In) + 2I„) demonstrates, in which 

case the divergence is D{Px\\Px) < log 2 but the differential entropies differ by 0(n). Therefore 
even for very smooth densities the difference in entropies is not controlled by the divergence. The 
situation for discrete alphabets is very similar, in the sense that the gap of Shannon entropies 
cannot be bounded by divergence in general (with essentially the same counterexample as that in 
the continuous case: X and X being uniform on one and two Hamming spheres respectively). 

The rationale of the above discussion is two-fold: a) Certain regularity conditions of the distri¬ 
butions must be imposed; b) Distances other than KL divergence might be more suited for bounding 
the entropy difference. Correspondingly, the main contribution of this paper is the following: Under 
suitable regularity conditions, the difference in entropy (in both continuous and discrete cases) can 
in fact be bounded by the Wasserstein distance, a notion originating from optimal transportation 
theory which turns out to be the main tool of this paper. 

We start with the definition of the Wasserstein distance on the Euclidean space. Given proba¬ 
bility measures P, Q on M"', dehne their p-Wasserstein distance (p > 1) as 

WpiP,Q)^miiE[\\X-Y\\P])^/P, (3) 

where || • || denotes the Euclidean distance and the inhmum is taken over all couplings of P and 
Q, i.e., joint distributions Pxy whose marginals satisfy Px = P and Py = Q- The following dual 
representation of the Wi distance is useful: 

Wi(P,Q)= sup [fdP-[fdQ. (4) 

Lip(/)<1 J J 

Similar to (1), it is easy to see that in order to control \h{X) — h{X)\ by means of W2{Px,Px)-‘ 
one necessarily needs to assume some regularity properties of Px and otherwise, choosing 
one to be a fine quantization of the other creates infinite gap between differential entropies, while 
keeping the W 2 distance arbitrarily small. Our main result in Section 2 shows that under moment 
constraints and certain conditions on the densities (which are in particular satisfied by convolutions 
with Gaussians), various information measures such as differential entropy and mutual information 
on M” are in fact yTi-Lipschitz continuous with respect to the VE 2 -distance. These results have 
natural counterparts in the discrete case where the Euclidean distance is replaced by Hamming 
distance (Section 4). 

Eurthermore, transportation-information inequalities, such as those due to Marton [Mar86] and 
Talagrand [Tal96], allow us to bound the Wasserstein distance by the KL divergence (see, e.g., 
[RS13] for a review). Eor example, Talagrand’s inequality states that if Q = M{0, S), then 

WiiP,Q) < ^^f^D{P\\Q), (5) 

loge 

where crmax(E) denotes the maximal singular value of S. Invoking (5) in conjunction with the 
Wasserstein continuity of the differential entropy, we establish (2) and prove a new outer bound for 
the capacity region of the two-user GIG, finally settling the missing corner point in [Gos85a]. See 
Section 3 for details. 

One interesting by-product is an estimate that goes in the reverse direction of (5). Namely, 
under regularity conditions on P and Q we have^ 

D{P\\Q)<J [ \\x\mdP + dQ)-W2iP,Q) (6) 

y JR" 

^For positive a, b, denote a < 6 if a/b is at most some universal constant. 
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See Proposition 1 and Corollary 4 in the next section. We want to emphasize that there are a 
number of estimates of the form D{Px+z\\Px^z) ~ Px) where X^X are independent of 

a standard Gaussian vector Z, cf. [Vil03, Chapter 9, Remark 9.4], The key difference of these 
estimates from (6) is that the W 2 distance is measured after convolving with Pz- 

Notations Throughout this paper log is with respect to an arbitrary base, which also specifies the 
units of differential entropy h{-), Shannon entropy P[{-), mutual information /(•;•) and divergence 
Z)(-||-). The natural logarithm is denoted by In. The norm of x E M” is denoted by ||x|| = 
x‘j)^P. For random variables X and Y, let X i T denote their independence. 


2 Wasserstein-continuity of information quantities 

We say that a probability density function p on is (ci, C 2 )-regular if ci > 0, C 2 > 0 and 

||V logp(x)|| < ci||x|| + C 2 , Vx E . 

Notice that in particular, regular density is never zero and furthermore 

I logp(x) - logp(0)| < y ||x|p + C 2 ||x|| 

Therefore, if X has a regular density and finite second moment then 

\h{X)\ < I logPv(0)| + C 2 E[||X||] + |e[||X||2] < 00 . 

Proposition 1. Let U and V be random vectors with finite second moments. If V has a ( 01 , 02 )- 
regular density pv, then there exists a coupling Pijv, such that 


E 


log 


Pv(V) 


Pv(U) 


< A, 


(7) 


where 

Consequently, 


A = (y V^EHm + ^^/E^r^ + C2) 1T2(Pc/, Fy ). 


h(U)-h(V) < A. 

If both U and V are ( 01 , 02 )-regular, then 

\h(U) - h(V)\ < A, 
D(Pu\\Pv) + D(Pv\\Pu) <2A. 


Proof. First notice: 

\logpv(v) - logpv(u)\ = 


,u-v 


< 


f dt {V log Pv (tv-\-(1 — t)u) 

Jo 

/ dt(c2 + Cit\\v\\ + Ci(l - t)||M||)||rt - V 

Jo 


= (C2 + ci||u||/2 + Ci||u||/2)||rt - u||, 


( 8 ) 


(9) 

( 10 ) 


( 11 ) 

( 12 ) 

(13) 
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where (12) follows from Cauchy-Schwartz inequality and the (ci, C 2 )-regularity of pv- Taking ex¬ 
pectation of (13) with respect to {u, v) distributed according to the optimal lT 2 -coupling of Pu and 
Pv and then applying Cauchy-Schwartz and triangle inequality for L 2 -norm, we obtain (7). 

To show (8) notice that by finiteness of second moment h{U) < oo. If h{U) = —oo then there 
is nothing to prove. So assume otherwise, then in identity 


h{U) - h{V) + D{Pu\\Pv) 



Pv{V) 

PviU) 


(14) 


all terms are finite and hence (8) follows. Clearly, (8) implies (9) (when applied with U and V 
interchanged). 

Finally, for (10) just add the identity (14) to itself with U and V interchanged to obtain 


D{Pu\\Pv) + D{Pv\\Pu) = E 



PvjV) 

Pv{U) 


-FE 



Pu{U) 

Pu{V) 


and estimate both terms via (7). 


□ 


The key question now is what densities are regular. It turns out that convolution with sufficiently 
smooth density, such as Gaussians, produces a regular density. 

Proposition 2. Let V = B + Z where B JL Z ^ N'{0,cr‘^ln) and E[||i?||] < oo. Then the density 
ofV is {ci^ 02 )-regular with ci = ^ and C 2 = 


Proof. First notice that whenever density pz of Z is differentiable and non-vanishing, we have: 


Vlogpv'(^') = = E[Vlogpz(i; - B)\V = n], 

pv{v) 

where pv{v) = E,[pz{v — B)] is the density of V. For Z ~ A^(0, a'^In), we have 

Vlogpziv - B) = ^^^{B - v). 

So the proof is completed by showing 

E[\\B - n|| 11/ = n] < 3||n|| + 4E[||5||] . 

For this, we mirror the proof in [WV12, Lemma 4]. Indeed, we have 

.pz{B - v) 


E[||S-u|||I/= n] =E 


\B — u| 


pv{v) 


< 2E[||i? — v\\l{a{B, v) < 2}] -|- E[||i? — v\\a{B, u)l{a(i?, v) > 2}], 


where we denoted 


o(iI, v) = 


pz{B - v) 
pv{v) 


(15) 


(16) 

(17) 

(18) 


Next, notice that 

{a{B, v) > 2} = {||i? — v\\^ < —2(7^ ln((27r<T^)"'/^2pv"('i’))} • 

Thus since E[p^(i? — u)] = pv{v) we have an upper bound for the second term in (18) as follows 


E[||7? — v\\a{B^ v)l{a{B, v) > 2}] < 


(27rcj^) 2 2pv{v) 


(19) 
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where In"*" x 
therefore 


= max{0, Inx}. From Markov inequality we have P[||-B|| < 2IE[||i?||]] > 1/2 and 


pv{v) > 


2(27r(j 




||^>||+2E[||B||])2 

2 ^ 


Using this estimate in (19) we get 


E[\\B -v\\a{B,v)l{a{B,v) > 2}] < ||u|| + 2E[||5||] . (20) 

Upper-bounding the first term in (18) by 2E[||il||] -|- 2||u|| we finish the proof of (16). □ 

Another useful criterion for regularity is the following: 

Proposition 3. If W has ( 01 , 02 )-regular density and B ILW satisfies 

||il|| < VnP a.s. (21) 

then V = B W has (ci, C 2 -|- ciV nP)-regular density. 

Proof. Apply (15) and the estimate: 

E[||Vlogpiy(u — i?)|| I U = u] < Ci(||u|| + Vn^) + 02 - □ 

As a consequence of regularity, we show that when smoothed by Gaussian noise, mutual in¬ 
formation, differential entropy and divergence are Lipschitz with respect to the lU 2 -distance under 
average power constraints: 

Corollary 4. Assume that X,X It Z, with E[||X|p], E[|| A|p] < nP and Z ~ A^(0, a'^In)- Then 

\I(X; X + Z)- /(A; X + Z)\ = \h{X + Z) - h{X + Z)\ < A, (22) 

^i^x+zW^x+z) + D{Px+z\\Px+z) ^ 2A, (23) 

where A = ^I^(3^n(a^ + P) + 4V^)W2{Px+z, Px+z)- 

Proof. Since E[||A||] < y/nP, by Proposition 2, the densities oi X -\- Z and X -\- Z are both 
(3^e^ 4U^ioge)-reguiar. The desired statement then follows from applying (9)-(10) to U = X -\- Z 
and U = A + A. □ 

Remark 1. The Lipschitz constant y/n is order-optimal as the example of Gaussian A and A 
with different variances (one of them could be zero) demonstrates. The linear dependence on IT 2 
is also optimal. To see this, consider A ~ AA(0,1) and A ~ AA(0 ,1 -\-1) in one dimension. Then 
\h{X + Z)-h{X + Z)\ = l/21og(l + t/2) = e(t) and 1U|(A + Z,A + Z) = {y/2Tt-V2)^ = e{t^), 
as t 0. 


In fact, to get the best constants for applications to interference channels it is best to forgo the 
notion of regular density and deal directly with (15). Indeed, when the inputs has bounded norms, 
the next result gives a sharpened version of what can be obtained by combining Proposition 1 with 

2 . 

Proposition 5. Let B satisfying (21) and G ~ M(0, c^Qln) be independent. Let V = B -\-G. Then 
for any U, 

h(U) - h(V) < ^ (E[||Uf ] - E[||U||2] + 2V^Wi(Pu, Py)) ■ (24) 

G 
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Proof. Plugging Gaussian density pg{z) = e into (15) we get 

Vlog^lv'(^') = - v ), (25) 

where B{v) = E[B\V = ■(;] = satisfies 

||5(u)|| < V^, 


since ||-B|| < y/nP almost surely. Next we use 


10gPV_W ^ f df (yiQgpy(^ty (^l _ 

pv{u) Jo 

f dt{B{tv + {1 — t)u),v — u) 

Jo 


log e 


4 ^0 


2al 


kl|2- 


cr. 


G 


2al 


(26) 

(27) 

(28) 


Taking expectation of the last equation under the ITi-optimal coupling and in view of (14), we 
obtain (24). □ 


To get slightly better constants in one-sided version of (22) we apply Proposition 5: 

Corollary 6. Let A,B,G,Z be independent, with G ^ M{0,aQln), Z ~ A^(0, cr^/„) and B satis¬ 
fying (21). Then for every c G [0,1] we have: 


h{B + A + Z)-h{B + G + Z) 

2 2 ' 2 E[B1) - E[l|Opl) + 

^\^G ' ^z) 


\l2nPia‘f. + c^a%) log e 

/ - ^D{Pa^^z\\Pg^cz) 




(29) 


Proof. First, notice that by definition Wasserstein distance is non-increasing under convolutions, 
i.e., VF 2 (Pi *Q,P 2 *Q) < W 2 {Pi, P 2 )- Since c < 1 and Gaussian distribution is stable, we have 


W2{Pb+a+z, Pb+g+z) < bF2(Py4-i-z, Pg-i-z) < 142(Pa-i-cZ, Pg-i-cz), 
which, in turn, can be bounded via Talagrand’s inequality (5) by 

bF2(PA-i-cZ,PG-i-cz) < ^ ^ ^loge ^i^A-ecz\\PG-ecz) ■ 

From here we apply Proposition 5 with G replaced by G -|- Z (and aQ by cr| -|- Uq). □ 


3 Applications to Gaussian interference channels 

3.1 New outer bound 

Consider the two-user Gaussian interference channel (GIG): 

Ti = Xi + bX2 + Zi 
Y 2 = aXi X 2 Z 2 , 
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with a, 6 > 0, Zj ~ M{0,ln) and a power constraint on the n-letter codebooks: either 


||^i||<V^, \\X 2 \\<V^ a.s. 


(31) 


or 

E[\\Xif]<nPu E[||X 2 f] <nP 2 . (32) 

Denote by TZ{a,b) the capacity region of the GIC (30). As an application of the results developed 
in Section 2, we prove an outer bound for the capacity region. 


Theorem 7. Let 0 < 

sure power constraint 
satisfies 


Ri < 


1 

2 


a < 1. LetC 2 = ^log(l + P 2 ) and C 2 = ^ log(l+ Assume the almost 

(31 ). Then for any b >0 and C 2 < R 2 < C 2 , any rate pair {Ri,R 2 ) G TZ{a, b) 


log min <j A —^ + 1, A 


(1 + P 2 )(l - (1 - o^) exp(-25)) - a? 


P 2 


(33) 


where 


A = {Pi+a ^(1 + P 2 )) exp(-2i22), 
5 = C 2 — R 2 + 01 


'2Pi(C 2 - i22)loge 


I + P 2 

Assume the average power constraint (32). Then (33) holds with 6 replaced by 


5' — C 2 — R 2 + 


^ 2(C2^ R2j^ ^°^ (3Vl + a^Pi + P 2 + 4a^/Wl). 


(34) 

(35) 


(36) 


Consequently, in both cases, R 2 > C 2 — e implies that ^ log(l + ^ 1 ^) — ^ where e' = 0{y/e) 

as e ^ 0. 


Proof. Without loss of generality, assume that all random variables have zero mean. First of all, 
setting 6 = 0 (which is equivalent to granting the first user access to A 2 ) will not shrink the capacity 
region of the interference channel (30). Therefore to prove the desired outer bound it suffices to 
focus on the following Z-interference channel henceforth: 


Ti = Ai + Zi 

y2 = aXi + X2 + Z2 . 


(37) 


Let (Ai, A 2 ) be n-dimensional random variables corresponding to the encoder output of the first 
and second user, which are uniformly distributed on the respective codebook. For z = 1,2 define 

Ri^-I{Xf,Yi}. 

n 

By Fano’s inequality there is no difference asymptotically between this definition of rate and the 
operational one. Define the entropy-power function of the Ai-codebook: 

iVi(t) Aexp|^/z(Ai + VtZ)| , Z^Af{0,In). 

We know the following general properties of Ni(t): 
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• A^i is monotonically increasing. 

• Aii(O) = 0 (since Xi is uniform over the codebook). 

• N[(t) > 21X6 (since N\{t + 5) > N\{t) + 27re5 by entropy power inequality). 

• N\{t) is concave (Costa’s entropy power inequality [Cos85a]). 

• Nxii) < 27re(Pi + t) (Gaussian maximizes differential entropy). 

We can then express i?i in terms of the entropy power function as 

1 , ^^i(l) 

It remains to upper bound Ni{l). Note that 


(38) 


ni ?2 = I{X 2 \Y 2 ) = h{X 2 + aXx + Z) - h{aXx + ^) < ^ log 27re(l + Pa + a^Pi) - KaXi + Z) 


W ( —1 < 2ixeA. 


and therefore 


where A is defined in (34). This in conjunction with the slope property N[{t) > 27xe yields 

1 


iVi(l) <iVi 


— 27re(a ^ — 1) < 27re(^ — a ^ + 1) 


-2 


(39) 


(40) 


which, in view of (38), yields the first part of the bound (33). 

To obtain the second bound, let G 2 ~ Af{0, P 2 ln)- Using E[||X 2 |p] < nPa and Xi i X 2 , we 
obtain 

nPa = I{X2;Y2) < I{X2-,Y2\Xi) = /(W2;X2 + Z 2 ) 

= nC2 - h{G2 + Z 2 ) + h{X2 + Z 2 ) < nCa - D{Px^ +Z2H PG2+Z2 ) j 


that is. 


Furthermore, 


D(Px 2 +Z 2 11Pg2+Z2) < HG 2 + Z 2 ) - h{X 2 + Z 2 ) < n{G 2 - R 2 ). 

nR2 = I{X2]Y2) = h{aXx +X2 + Z2) - h{aXx +G2 + Z2) 

+ h[aXi + G2 + Z2) — h[aXi + Z2) ■ 

Ni(^) 


(41) 

(42) 

(43) 


'2o2Pi(C'2-P2)loge 


Note that the second term (43) is precisely | log "^1^® term (42) can be bounded by 

applying Corollary 6 and (41) with B = aXi, A = X 2 , G = G 2 and c = 1: 

h{aXi + X 2 + Z 2 ) — h{aXi + G 2 + Z 2 ) < 

Combining (42) - (44) yields 

Nx 


I + P 2 


(44) 


1 \ ^ exp(2<5) 


1 + P 2 


(45) 









where <5 is defined in (35). From the concavity of Ni{t) and (45) 

iVi(l) < -fNi - (7 - l)iVi (46) 

- (^) 

where 7 = 1 + 4^^ > 1. In view of (38), upper bounding A^i (l/o^) in (47) via (39) we get after 
some simplifications the second part of (33). 

The outer bound for average power constraint (32) follows analogously with (44) replaced by 
(48) below: By Proposition 2, the density of aXi + G 2 + Z 2 is (fqr^, —)-regular. Applying 

Proposition 1 to (44), we have h{aXi + X 2 + Z 2 ) — h{aXi + G 2 + Z 2 ) < A, where 

A = (3\/1 + a? Pi + P 2 + 4,ay/^)—^-^\/nW2{PaXi+X2+Z2 1 ^aXi+G2+Z2') • 

i + p 2 

Again using the fact that IF 2 distance is non-decreasing under convolutions and invoking Tala- 
grand’s inequality, we have 


' 2 ( 1 + P 2 ) 
log e 


D{Px2+Z2\\PG2 + Z2) 


W2{PaXi+X2+Z2, PaXi+G2+Z2) < W2{Px2+Z2, PG 2 +Z 2 ) < 
which yields 

h{aXi +X 2 + Z 2 ) - h{aXi + G 2 + Z 2 ) < (3^1 + a?Pi + P 2 + 4a/p(). (48) 

V 1 + T 2 

This yields the outer bound with 5' defined in (36). 

Finally, in both cases, when R 2 ^ G 2 , we have <5 —>■ 0 and A ^ and hence from (33) 

Ri<C[. 1 ^ 


Remark 2. The first part of the bound (33) coincides with Sato’s outer bound [Sat78] and [Kra04, 
Theorem 2] by Kramer, which [Kra04, Theorem 2] was obtained by reducing the Z-interference 
channel to the degraded broadcast channel; the second part of (33) is new, which settles the missing 
corner point of the capacity region (see Section 3.2 for discussions). Note that our estimates on 
Ai(l) in the proof of Theorem 7 are tight in the sense that there exists a concave function Ni(t) 
satisfying the listed general properties, estimates (45) and (39) as well as attaining the minimum 
of (40) and (47) at Ai(l). Hence, tightening the bound via this method would require inferring 
more information about Ni(t). 


Remark 3. The outer bound (33) relies on Costa’s EPI. To establish the second statement about 
corner point, it is sufficient to invoke the concavity of 7 1 —)■ /(A 2 ; ^X 2 -|- .^ 2 ) [GSSV05, Corollary 
1], which is strictly weaker than Costa’s EPL 


The outer bound (33) is evaluated on Eig. 1 for the case of 6 = 0 (Z-interference), where we 
also plot (just for reference) the simple Han-Kobayashi inner bound for the Z-GIC (37) attained 
by choosing X\ = U + V with U XV jointly Gaussian. This achieves rates: 


^Rl-\ log(l + Pi - S) + 1 log (^1 + 0 < s < Pi 

P2 = 2 loS (1 + l+a:^(Pi-s)) 


(49) 


Eor more sophisticated Han-Kobayashi bounds see [Sas04, Gosll]. 
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Z-interference channel, PI =1, P2=1, a = 0.8 



Figure 1: Illustration of the “missing corner point”; The bound in Theorem 7 establishes the 
location of the upper corner point, as conjectured by Costa [Cos85b]. The bottom corner point has 
been established by Sato [Sat78]. 

3.2 Corner points of the capacity region 

The two corner points of the capacity region are defined as follows: 

C'i{a, b) = max{i?i : {Ri,C 2 ) G TZ{a, 6)} , (50) 

C' 2 (a, b) = max{i ?2 : (Ci, i? 2 ) G Tl{a, 6)} , (51) 

where Ci = ^ log(l + Pi). As a corollary, Theorem 7 completes the picture of the corner points for 
the capacity region of GIC for all values of a, 6 G M+ under the average power constraint (32). We 
note that the new result here is the proof of C[{a, b) = ^ log for 0 < a < 1 and 5 > 0. 

The interpretation is that if one user desires to achieve its own interference-free capacity, then the 
other user must guarantee that its message is decodable at both receivers. The achievability of this 
corner point was previously known, while the converse was previously considered by Costa [Cos85b] 
but with a flawed proof, as pointed out in [Sas04] . The high-level difference between our proof and 
that of [Cos85b] is the replacement of Pinsker’s inequality by Talagrand’s and the use of a coupling 
argument.^ 

Below we present a brief account of the corner points in various cases; for an extensive discussion 
see [Sasl5]. We start with a few simple observations about the capacity region Tl{a,b): 

• Any rate pair satisfying the following belongs to Tl{a, b): 

^1 < ^ log(l Pi min(l, a^)) 

^2 < ^ log(l P 2 min(l, 6^)) (52) 

Ri + R 2 log(l min(Pi -h b^P 2 ,P 2 + a^Pi)), 

^After circulating our initial draft, we were informed that authors of [BPS14] posted an updated 
manuscript [BPS 15a] that also proves Costa’s conjecture. Their method is based on the analysis of the minimum 
mean-square error (MMSE) properties of good channel codes, but we were not able to verify all the details. A further 
update is in [BPS 15b]. 
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which corresponds to the intersection of two Gaussian multiple-access (MAC) capacity regions, 
namely, (Xi,X 2 ) —>• Yi and (Xi,X 2 ) —)• Y 2 . These rate pairs correspond to the case when 
each receiver decodes both messages. 

• For a > 1 and b > 1 (strong interference) the capacity region is known to coincide with (52) 
[Car75, SatSl]. So, without loss of generality we assume a < 1 henceforth. 

• Replacing either a or 5 with zero can only enlarge the region (genie argument). 

• If 6 > 1 then for any (i?i, R 2 ) G TZ{a, b) we have [SatSl] 

R 1 + R 2 < ^ log (1 + 52 ^ 2 + Ri) . (53) 

This follows from the observation that in this case I{Xi,X 2 ]Yi) = F7(Xi,A 2 ) — o(n), since 
conditioned on Ai, Y 2 is a noisier observation of X 2 than Yi. 


For the top corner, we have the following: 


C[{a,b) 


^log(l + g:^) 


Cl, 

^log 

,^log 



Pl + {b^-l)P2\ 
I+P 2 ) ’ 
Pi \ 

TWK ) ’ 


0<a<l,6>0 
a = 0,6 = Oor6> ^/1 + Pi 
a = 0,1 < 6 < VI + -Pi 
a = 0,0 < 6 < 1. 


(54) 


Note that for any 6 > 0, a C^a, 6) is discontinuous as a | 0. To verify (54) we consider each 
case separately: 


1. For o > 0 the converse bound follows from Theorem 7. For achievability, we consider two 
cases. When 6 < 1, we have < i^^P 2 therefore treating interference A 2 as noise at 
the first receiver and using a Gaussian MAG-code for (Ai, A 2 ) —)• T 2 works. For 6 > 1, the 
achievability follows from the MAG inner bound (52). Note that since ^ log (l + Pi -|- b‘^P 2 ) > 
^ log (1 -b P 2 + a?Pi) , a Gaussian MAG-code that works for (Ai, A 2 ) —)• T 2 will also work for 
(Ai, A 2 ) ^ Fi. Alternatively, the achievability also follows from Han-Kobayashi inner bound 
(see, e.g., [EGKll, Theorem 6.4] with (^ 1 ,^ 2 ) = (-Ti,A 2 ) for 6 > 1 and {Ui,U 2 ) = (Ai,0) 
for 6 < 1). 

2. For a = 0 and b > y/1 -|- Pi the converse is obvious, while for achievability we have that 

< P 2 and therefore A 2 is decodable at Fi. 

3. For a = 0 and 1 < 6 < vT+~^ the converse is (53) and the achievability is just the MAG 
code (Ai, A 2 ) —>• Fi with rate R 2 = C' 2 . 

4. For a = 0 and 0 < 6 < 1 the result follows from the treatment of CV^V) below by inter¬ 
changing a b and Pi P 2 . 


i+Pi 


The bottom corner point is given by the following: 

^log(l + ^^^), 0<a<l,6 = 0or6>y^ 

CVa, ^) = < ^ log (l + , 0<a<l,l<6< 

ilog(l + g^j, 0<a<l,0<6<l 

which is discontinuous as 6 | 0 for any fixed a E [0,1]. We treat each case separately: 


(55) 
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1. The case of (72(0,0) is due to Sato [Sat78] (see also [Kra04, Theorem 2]). The converse part 
also follows from Theorem 7 (for o = 0 there is nothing to prove). For the achievability, 

we notice that under b > have and thus X 2 at rate C 2 {a, 0 ) 

can be decoded and canceled from Yi by simply treating Xi as Gaussian noise (as usual, we 
assume Gaussian random codebooks). Thus the problem reduces to that of 6 = 0. For 6 = 0, 
the Gaussian random coding achieves the claimed result if the second receiver treats Xi as 
Gaussian noise. 

2. The converse follows from (53) and for the achievability we use the Gaussian MAC-code 
(Ai, A 2 ) —)• Fi and treat Xi as Gaussian interference at 12- 

3. If 6 E (0,1], we apply results on C[{a, b) in (54) by interchanging a b and Pi ^ P 2 - 

4 Discrete version 

4.1 Bounding entropy and information via Ornstein’s distance 

Fix a finite alphabet X and an integer n. On the product space A”’ we define the Hamming distance 

n 

dH{x,y) ^ ^ i{xjytyj} ) 
i=i 

and consider the corresponding Wasserstein distance Wi. In fact, ^Wi{P, Q) is known as Ornstein’s 
d-distance [GNS75, Mar86], namely, 

d{P,Q) = -miE[dH{X,Y)], (56) 

n 

where the infimum is taken over all couplings Pxy of P and Q. For n = 1, this coincides with the 
total variation, which is also expressible as dpY{P, Q) = ^ f \dP — dQ\ for P, Q on X. 

For a pair of distributions P, Q on X^ we may ask the following questions: 

1. Does D{P\\Q) control the entropy difference H{P) — H{Q)1 

2. Does d{P,Q) control the entropy difference H{P) — H{Q)7 

Recall that in the Euclidean space the answer to both questions was negative unless the distributions 
satisfy certain regularity conditions. For discrete alphabets the answer to the first question is still 
negative in general (see Section 1 for a counterexample); nevertheless, the answer to the second 
one turns out to be positive: 

Proposition 8. Let P and Q be distributions on X"' and let 

Fx{x) = xlogdAl — 1) + xlog — + (1 — x) log —^— , 0 < X < 1. 

X 1 — X 

Then 

\H{P) - H{Q)\ < nFx{d{P,Q)) . (57) 
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Proof. In fact, the statement holds for any translation-invariant distance on X extended 

additively to i.e., d{x,x') = d{xi,x'f) for any x,x' G X"^. Indeed, define 

fn{s) = max : E[d(X,xo)] < nsl , 

Px in J 

where xq G X'^ is an arbitrary fixed string. It is easy to see that s fn(s) is concave since 
P I—)• H{P) is. Furthermore, writing X = {Xi, ..., Xn) and applying chain-rule for entropy we get 

fn{s) = /l(s) . 

Thus, letting X, Y be distributed according to the J-optimal coupling of P and Q, we get 


H{X) - H{Y) < H{X, Y) - H{Y) = H{X\Y) (58) 

<nE[/„(E[d(X,y)|y])] (59) 

<nfn{d{P,Q)), (60) 

where (59) is by definition of fn{') and (60) is by Jensen’s inequality. Finally, for the Hamming 

distance we have fi{s) = Fx{s) by Fano’s inequality. □ 


Notice that the right-hand side of (57) behaves like redlog ~ when d{P,Q) is small. This super- 
linear dependence is in fact sharp. ^ Nevertheless, if certain regularity of distributions is assumed, 
the estimate (57) can be improved to be linear in d{P,Q). The next result is the analog of Propo¬ 
sition 1 in the discrete space. We formulate it in a form convenient for applications in multi-user 
information theory. 

Proposition 9. Let Py\x,a be a two-input hlocklength-n memoryless channel, namely 

n 

PY\X,A{y\^^^) = n 

J=1 

where iy(-|-) is a stochastie matrix and y G y”,x G X'^,a G Let X,A,A he independent n- 
dimensional discrete random vectors. Let Y and Y he the outputs generated by {X,A) and {X,A), 
respeetively. Then 


where 


\H{Y)-H{Y)\ <cnd{PY,PY) 

(61) 

DiPY\\Py) + D{Py\\PY) < 2cnd{PY,Py) 

(62) 

\I{X-Y) - I{X-Y)\ < 2cn¥.[d{Py\x,PY\x)] 

(63) 

A , W{y\x,a) 

c = max log 

x,a,y,y' W{y'\x,a) 

(64) 

E[(i(Py|jf, Px{x)d{Py\x=x-iPY\X=x)- 

(65) 


®To see this, consider Q = Bern(p)®" and choose P to be the output distribution of the optimal lossy compressor 
for Q at average distortion 5n. By definition, d{P,Q) < 5. On the other hand, H{P) = n{h{p) — h{S) + o(l)) as 
n —>■ oo and hence \H{P) — H{Q)\ = n{h{5) + o(l)), which asymptotically meets the upper bound (57) with equality. 
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Proof. Given any stochastic matrix G, define L{G) = log Recall the follow¬ 

ing fact from [PV14, Eqn. (58)] about mixtures of product distributions: Let U and V be n- 
dimensional discrete random vector connected by a product channel, that is, Pv\u = 

Then the mapping v i—>• logPv{v) is L-Lipschitz with respect to the Hamming distance, where 
L = maxjg[„] L{Py.^ij.). Consider another pair {U,V) connected by the same channel, i.e., Py^ij = 


Pv\u- Then Lipschitz continuity implies that E 


log 


PvjV) 

Pv{V) 


< LW\dH{V, V)] for any coupling Pyy- 


Optimizing over the coupling and in view of (56), we obtain 


E 



Pv{V) 

Pv{V) 


< Lnd{Py,Py). 


Repeating the proof of (8)-(10), we have 


\H{V)-H{y)\<Lnd{Py,Py) 
D{Py\\Py) + D{Py\\Py) < 2cnd{Pv,Py) 


( 66 ) 

(67) 


Applying (66) and (67) to Y and {X,A) gives (61) and (62) with L = c defined in (64). 
To bound the mutual information, we first notice 


\I{X-Y) - I{X-,Y)\ < \H{Y) - H{Y)\ + \H{Y\X) - H{Y\X)\. 


Applying (66) conditioned on X = x we get 

\H{Y\X = x) - H{Y\X = x)| < cM{PY\x=.,PY\x=a) > 

where Cx = max^gj^j uiaxy^y/^a log w(y^\x’'a) • Note that Cx < c for any x, averaging over Px gives 

\H{Y\X) - H{Y\X)\ < cnE[d{Py\x, Py\x)] • (68) 

From the convexity of {P,Q) e-)- d{P,Q), which holds for any Wasserstein distance, we have 
fi(Py,Py) < E[d{PY\x , Py\x)] ®o the left-hand side of (68) also bounds \H{Y) — H{Y)\ from 
above. □ 


4.2 Marion’s transportation ineqnality 

In this section we discuss how previous bounds (Proposition 8 and 9) in terms of the J-distance 
can be converted to bounds in terms of KL divergence. This is possible when Q is a product 
distribution, thanks to Marton’s transportation inequality [Mar86, Lemma 1]. We formulate this 
together with a few other properties of the d-distance in the following lemma proved in Appendix A. 

Lemma 10. 


1. (Marton’s transportation inequality [Mar86]): For any pair of distributions P and Q = 

nr=i Qi on 


d{P,Q)< 


lD{P\\Q) 


2n log e 

2. (Tensorization) d(nr=i Pi^ Wl=i Qi) ^ ^ Ya=i dTY{Pi,Qi)- 


(69) 
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3. (Contraction) For Pxy and Qxy such that Py\x = Qy\x = 

d{PY,QY) < niaxr/Tv(-fVi|Xi)^^(-Px,Qx)- (70) 

ie[n] 

where rjYx{W) is Dobrushin’s contraction coefficient of a Markov kernel W defined as rn^xiW) — 
^^Vx,x>dTx{W{-\x),W{-\x')). 

If we assume that D{P\\Q) = en for some small e, then combining (57) and (69) gives 




where the right-hand side behaves as n-y^log^ when e —)• 0. This estimate has a one-sided im¬ 
provement (here again Q must be a product distribution): 


H{P) - H{Q) < 


l2nD{P\\Q) 


log e 


log|<T| 


(71) 


(see [CS07] for n = 1 and [WVIO, Appendix H] for the general case). 

Switching to the setting in Proposition 9, let us consider the case where A has i.i.d. components, 
i.e., P^ = Pq^. Dehne 

t7tv — max dYx{W{-\x, a), W{-\x, a !)), (72) 

a:,a,a' 

which is the maximal Dobrushin contraction coefficients among all channels W{-\-,x) indexed by 
X E A. Then 


diPY^Py) < lE[d(Py|x,T’y|^)] < riTxd{PA,Px) < f?TV1 


I d{Pa\\Pa) 

2n log e 


(73) 


where the left inequality is by convexity of the d-distance as a Wasserstein distance, the middle 
inequality is by Lemma 10, and the right inequality is via (69). An alternative to the estimate (73) 
is the following: 

d(Py , Py) < IE[d(Py|jf , Ty|j(^)] 

< E 


1 


2n log e 


D{Py\x\\Py\x'') 




(74) 

(75) 

<^l —^-ZlfPvi^llK>,^lPvi (76) 

where (75) is by (69) since Py^^^^ is a product distribution as A has a product distribution, (76) 
is by Jensen’s inequality, and (77) is by the tensorization property of the strong data-processing 
constant for divergence [AG76]: 

. A _,^DiZaQo{a)Wi-\x,a)\\EaPo{a)W{-\x,a)) 

VKh = max- II „ ^ -■ 

x,Qo D{Po\\Qo) 

To conclude, in the regime of D{Pa\\Pa) < for some small e our main Proposition 9 yields 

\H(Y)-H(Y)\<nVe (78) 


matching the behavior of (71). However, the estimate (78) is stronger, because a) it is two-sided 
and b) Py can be a mixture of product distributions (since X in Proposition 9 may be arbitrary). 
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4.3 Application: corner points for discrete interference channels 

In order to apply Proposition 9 to determine corner points of capacity regions of discrete memoryless 
interference channels (DMIC) we will need an auxiliary tensorization result. This result appears to 
be a rather standard exercise for degraded channels and so we defer the proof to Appendix B. 

Proposition 11. Given channels Pa\x Pb\a on finite alphabets, define 

Ffit) = max{i4(A|A, U ): H{X\B, U) < t,U ^ X ^ A ^ B] . (79) 

Then the following hold: 

1. (Property of Fc) The function Fc ; K_|_ —)• M_|_ is concave, non-decreasing and Ffit) < t. 
Furthermore, Ffit) < t for all t > 0, provided that Pb\a o.nd Pa\x satisfy 



PB\A=a / PB\A=aA 

Va a' 

(80) 

and 





Pa\X=x / Pa\X=xA 

Mx x', 

(81) 

respectively. 





2. (Tensorization) For any blocklength-n Markov chain X” -A -A B^, where Pa^\X'^ = ^a\x 
and Pgrij^n = Pq(^j^ are n-letter memoryless channels, we have 

H{X^\A^) < nFc (^H{X^\B^)^ . (82) 


Remark 4. Neither of the sufficient condition (80) and (81) for strict inequality is superfluous, as 
can be seen from the example B = A and A JL X, respectively; in both cases Ffit) = t. 

The important consequence of Proposition 11 is the following implication:^ 

Corollary 12. Let X^ -A A” -A B^, where the memoryless channels Pa\x ond Pb\a of blocklength 
n satisfy the conditions (80) and (81). Then there exists a continuous function g : M+ —)• M+ 
satisfying g(0) = 0, such that for all n 

liX^^A'^) <I{X^;B^) + en H{X^) < I{X^-, B^) + g{e)n, (83) 

Proof. By Proposition 11 , we have Ffit) < t for all t > 0. This together with the concavity of 
Fc implies that t ca t — Fc{t) is convex, strictly increasing and strictly positive on (0, oo). Define 
g as the inverse of t >-a t — Fc{t), which is increasing and concave and satisfies g{0) = 0. Since 
/(X”; A”) < /(X”; B”) + en, the tensorization result (79) yields 

H{X^\B^) < R(X"|A") + en<nFc(- 

\n 

i.e., t < Fc{t) + e, where t = ^F[{X'^\B'^). Then t < g{e) by definition, completing the proof. □ 

^This is the analog of the following property of Gaussian channels, exploited in Theorem 7 in the form of Costa’s 
EPI: For i.i.d. Gaussian Z and <t 2 < fa we have 

I{X-X A t 2 Z) = I{X; X A LZ) A o(n) 7(X; X ALZ) ^ /(X; X + UZ) A o{n) . 

This also follows from the concavity of 7 i—>■ /(X; .^yX + Z). 


H{X^\B^) Aen, 
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We are now ready to state a non-trivial example of corner points for the capacity region of 
DMIC. The proof strategy mirrors that of Theorem 7, with Corollary 6 and Costa’s EPI replaced 
by Proposition 9 and Corollary 12, respectively. 

Theorem 13. Consider the two-user DMIC: 


Yi = Xi, (84) 

¥2 = X 2 + Xi + Z 2 mod 3 , (85) 

where Xi G {0,1,2}”, X 2 G {0,1}”, Z 2 G {0,1,2}” are independent and Z 2 ~ is i.i.d. for 
some non-uniform P 2 containing no zeros. The maximal rate achievable by user 2 is 

C 2 = max H{Q*P 2 )-H{P 2 ). ( 86 ) 

supp(Q)c{0,l} 

At this rate the maximal rate of user 1 is 

C'(=log3— max H(Q*P 2 ). (87) 

supp(Q)c{0,l} 

Remark 5. As an example, consider P 2 = [l “ |] where 5 7 ^ 0,1, Then the maximum in 

( 86 ) is achieved hy Q = [^, ^]. Therefore C 2 = H{Ps) — H{P 2 ) and C[ = log3 — //(Ps), where 
P 3 = |] • Note that in the case of = |, where Theorem 13 is not applicable, we simply 

have C 2 = 0 and C'l = log 2 since X 2 X ¥ 2 - Therefore the corner point is discontinuous in 5. 

Remark 6. Theorem 13 continues to hold even if cost constraints are imposed. Indeed, if X 2 G 
{ 0 , 1 , 2 }” is required to satisfy 

n 

^ b{X 2 ,i) < nB 

i=l 

for some cost function b : {0,1,2} —>• M. Then the maximum in ( 86 ) and (87) is taken over all 
Q such that Eg[b(C/)] < B. Note that taking R = 00 is equivalent to dropping the constraint 
X 2 G {0,1}” in ( 86 ). In this case, C} = 0 which can be shown by a simpler argument not involving 
Proposition 9. 

Proof. We start with the converse. Given a sequence of codes with vanishing probability of error 
and rate pairs (Pi, R 2 ), where R 2 = C 2 — e, we show that Pi < C} — e', where e' —0 as e ^ 0. Let 
Q2 be the maximizer of ( 86 ), i.e., the capacity-achieving distribution of the channel X2 e-)• A 2 -|-.^ 2 - 
Let X2 £ {0,1}” be distributed according to Qlf. Then A 2 + Z 2 ~ P®”, where P 3 = Q2 * P 2 - By 
Fano’s inequality, 

n{C 2 - e + 0 ( 1 )) = n(P 2 + o(I)) = /(A 2 ; ¥ 2 ) 

< /(A 2 ; ^ 2 ! Ai) = IiX2-,X2 + Z 2 ) (88) 

= nC2 - P(Px2+Z2ll-Px2+^2)’ 

that is, 

P(Px 2 +Z 2 ll^X 2 +Z 2 ) - ne + o{n). 

Since P; 5 ^ 2 +Z 2 ~ ^ product distribution, Marton’s inequality (69) yields 

d(Pvi+X 2 +Z 2 ,-Pvi+X 2 +Z 2 ) - d{Px2+Z2^Px^+z2) - + o(l)- 


17 




Applying (63) in Proposition 9 and in view of the translation invariance of the d-distance, we obtain 


|/(Xi; Fs) - I{XuX^ + X2 + Z 2 )| = |/(Ai; + X2 + Z^) - I{XuX^ + + ^2)! 

— ‘^^''^^i^(^Xi+X2+Z2\Xi^Pxi+X2+Z2\Xl)^ 

2cnd(^Px2+Z2i Px2+Z2^ 

< {ay/e + o(l))n, 

where c = niax^ j./g{o,i, 2 } log and a = ^ 2 \og e finite since P 2 contains no zeros by assump¬ 
tion. On the other hand, 

I(Ai;Xi + Z 2 ) = I{Xi-,Y 2 \X 2 ) = I{Xi-,Y 2 ) + I{Xi;X 2 \Y 2 ) = I{Xr,Y 2 ) + o{n), 

where /(Xi; A 2 II 2 ) < H{X 2 \Y 2 ) = o{n) by Fano’s inequality. Combining the last two displays, we 
have 

I{Xr,Xi +X 2 + Z 2 ) < /(Xi; Xi + Z 2 ) + {ayTe + o(l))n. 

Next we apply Corollary 12, with X = Xi A = X 1 + Z 2 B = A + X 2 . To verify the conditions, 
note that the channel Pa\x is memoryless and additive with non-uniform noise distribution P 2 , 
which satisfies the condition (81). Similar, the channel Pb\a is memoryless and additive with noise 
distribution Q 2 , which is the maximizer of (86). Since P 2 is not uniform, Q 2 is not a point mass. 
Therefore Pb\a satisfies (80). Then Corollary 12 yields 

nRi = H{Xi)<I{Xi-Xi+X2 + Z2) + 9{a^e)n<nC[+o{n), 

where the last inequality follows from the fact that maxxi I{Xi]Xi + X 2 + Z 2 ) = nC'i attained by 
Xi uniform on {0,1, 2}". 

Finally, note that the rate pair {0/02) is achievable by a random MAC-code for (Ai, X 2 ) — Y 2 , 
with Xi uniform on {0,1,2}”' and A 2 ~ Qf”- Q 
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A Proof of Lemma 10 

Proof. To prove the tensorization inequality, let {X,Y) = (Aj,Yj)”^]^ be independent and indi¬ 
vidually distributed as the optimal coupling of {Pi,Qi). Then K[dH{X,Y)] = P [Aj / 1}] = 
X)r=i'^Tv(-Pi,Qi)- 


18 





To show (70), let t^xyxy arbitrary coupling of Pxy and Qxy so that {X,X) is dis¬ 

tributed according to the optimal coupling of d{Px, Qx), that is, K^[dH{X, X)] = nd{Px, Qx)- By 
the first inequality we just proved, for any x,x' E 


d{PY\X=x,PY\X=x) < l^^dTx{PY,\Xi=Xi,PYi\Xi=Xi) < -^VTx{PYi\Xi)^Xi^Xi} < d^nix^x) 


2=1 


n 


2=1 


n 


where r] = maxjgj^] r/Tv(-fVi|Xi) and the middle inequality follows from Dobrushin’s contraction 
coefficient. Applying Dobrushin’s contractoin [Dob70] (see [PW16, Proposition 18], with p = 
^dn and r = pp), there exists a coupling of Pxy and Qxy, so that = '^xx 

IE 7 r'[d/i'(T, y)] < plK^^ldniX, X)] = npd{Px,Qx), concluding the proof. □ 


B Proof of Proposition 11 

Proof. Basic properties of Fc follow from standard arguments. To show the strict inequality Fc{t) < 
t under the conditions (80) and (81), we first notice that Fc is simply the concave envelope of the 
set of achievable pairs {P[{X\A),P[{X\B)) obtained by iterating over all Px- By Caratheodory’s 
theorem, it is sufficient to consider a ternary-valued U in the optimization defining Fc{t). Then the 
set of achievable pairs {H{X\A,U), H{X\B,U)) is convex and compact (as the continuous image 
of the compact set of distributions Pu,x)- Consequently, to have Fdt) = t there must exist a 
distribution Pu,x, such that 

H{X\A,U) = H{X\B,U) = t. (90) 

We next show that under the extra conditions on Pb\a and Pa\x we must have t = 0. Indeed, 
(80) guarantees the channel Pb\a satisfies the strong data processing inequality (see, e.g., [CKll, 
Exercise 15.12 (b)] and [PW16, Section 1.2] for a survey) that there exists p < 1 such that 

I{X-,B\U) <pI{X-,A\U). (91) 

From (90) and (91) we infer that I{X;A\U) = 0, or equivalently 

D{Pa\x\\Pa\u\Pu,x) = 0. 

On the other hand, the condition (81) ensures that then we must have H{X\U) = 0. Clearly, this 
implies t = 0 in (90). 

To show the single-letterization statement (82), we only consider the case of n = 2 since the 
generalization is straightforward by induction. Let X‘^ —>■ —)• B^ be a Markov chain with 

blocklength-2 memoryless channel in between. We have 

H{X‘^\Bd = H{Xi\Bd + H{X2\B‘^,Xi) (92) 

= H{X,\Bd + H(X2\B2,Xi) (93) 

>H{X,\BuA2) + H{X2\B2,XQ (94) 

where (93) is because B 2 -A X 2 -A Xi ^ Bi and hence /(W 2 ; i?ijA2i?2) = 0, and (94) is because 
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Bi —7- Xi —>■ ^42 —)• i? 2 - Next consider the chain 

H{X\A^) = H{XM^) + H[X2\A^,X^) ( 95 ) 

= H{Xi\A^) + H{X2\A2,Xi) ( 96 ) 

<F,{H(Xi\Bi,A2)) + F,{H{X2\B2,Xi)) ( 97 ) 

< 2F, {^H{X^\B^,A2) + ]^H{X2\B2,Xi)^ ( 98 ) 

< 2F, (^H{X‘^\B^)^ ( 99 ) 


where (96) is by A 2 ^ X 2 ^ Xi ^ Ai and hence /(X 2 ;^ 2 ) = 0, (97) is by the definition 
of Fc and since we have both ^2 —^ ^ ^ Bi and Xi ^ X 2 ^ A 2 ^ B 2 , (98) is by the 

concavity of Fc, and hnally (99) is by the monotonicity of Fc and (94). □ 
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